高性能運算中的根本轉變,是從以中央處理器(CPU)為中心的串行執行模型,轉向一種解耦的生產者-消費者模型。在這種模式下,中央處理器負責管理資料流,而圖形處理器(GPU)則獨立運作。核心認知是 圖形處理器(GPU)並非設計成嚴格同步的設備;若將其當作同步設備來操作,就會產生「停等」式的瓶頸。
1. 工作流程生命周期
在非同步的思維模式下,開發者不會等待每一項任務完成。相反地,他們會 配置 記憶體, 啟動 核函數,並 將結果複製回 透過將非阻塞請求放入硬體佇列中,完成操作。
2. 突破卡頓
當主機被迫在每次操作後 同步 進行同步時,執行間隙——即中央處理器與圖形處理器之間的傳輸時間——便成為性能的主要影響因素。透過運用 非同步,中央處理器可持續工作,同時圖形處理器則處理其資料流,從而最大化硬體的飽和度。
$$\text{總時間} = \max(\text{CPU 工作量}, \text{GPU 工作量}) + \text{同步開銷}$$
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which set of steps correctly converts a synchronous vector-add to use an explicit stream?
Call hipStreamCreate, use hipMemcpyAsync with the handle, and pass the handle as the 4th kernel argument.
Call hipDeviceSynchronize after every kernel launch and use hipMemcpy.
Set the stream parameter to NULL in all hipMemcpyAsync calls.
Replace hipMalloc with hipHostMalloc exclusively.
✅ Correct!
Correct. Explicit streams require handle creation, async memory operations, and passing the handle to the kernel configuration.❌ Incorrect
Using hipMemcpy (blocking) or the NULL stream (implicitly synchronous) defeats the purpose of the mindset shift.QUESTION 2
Why is a GPU considered 'not meant to be driven as a strictly synchronous device'?
Because it has no internal clock.
Because waiting for the CPU to confirm every command leaves thousands of cores idle.
Because memory transfers cannot be tracked by the CPU.
Because the GPU must manage its own power state.
✅ Correct!
GPU efficiency comes from high-throughput parallel work; synchronizing after every small step creates 'dead air' on the hardware.❌ Incorrect
The issue is latency and core utilization, not hardware clocking or power management.QUESTION 3
What is the primary risk of forcing the host to synchronize after every operation?
Memory corruption.
Host-side stalling and loss of hardware saturation.
Increased power consumption on the GPU.
Kernel compile errors.
✅ Correct!
Synchronous calls block the CPU, preventing it from preparing the next 'chunk' of work for the GPU.❌ Incorrect
While inefficient, it doesn't corrupt memory or cause compilation errors.QUESTION 4
In the logistics warehouse analogy, what does the 'Conveyor Belt' represent?
A HIP Stream.
The GPU Driver.
The CPU Cache.
The VRAM buffer.
✅ Correct!
A stream acts like a conveyor belt, allowing the CPU to load tasks sequentially without waiting for the worker (GPU) to finish the current one.❌ Incorrect
The stream is the FIFO queue mechanism that facilitates the non-blocking 'conveyor' flow.QUESTION 5
True or False: hipMemcpyAsync returns control to the CPU before the data transfer is complete.
True
False
✅ Correct!
Yes! This is the definition of non-blocking. The CPU just enqueues the request and moves on.❌ Incorrect
If it waited, it would be a standard synchronous hipMemcpy.Case Study: The Warehouse Manager's Bottleneck
Asynchrony Implementation
A legacy ROCm application uses standard hipMemcpy and kernel launches without stream handles. The CPU utilization is 98%, but the GPU is only at 15% utilization because it waits for the CPU to finish logging data before starting the next copy.
Q
Explain how Asynchrony would fix this 'stop-and-wait' bottleneck.
Solution:
By using asynchrony, the CPU can enqueue the next data transfer and kernel launch to a HIP stream and immediately return to its logging tasks. This allows the GPU to process the stream in parallel with the CPU's logging, keeping the compute cores saturated.
By using asynchrony, the CPU can enqueue the next data transfer and kernel launch to a HIP stream and immediately return to its logging tasks. This allows the GPU to process the stream in parallel with the CPU's logging, keeping the compute cores saturated.
Q
Provide the code required to create a stream and launch a kernel into it (replacing a default launch).
Solution:
hipStream_t myStream;
hipStreamCreate(&myStream);
myKernel<<<grid, block, 0, myStream>>>(args);Q
What function must be called to ensure the data is fully copied back to the host before the CPU accesses it?
Solution:
hipStreamSynchronize(myStream); must be called. This is the explicit 'handshake' that confirms all previous work in that specific stream is complete.